19 research outputs found

    Fast thread communication and synchronization mechanisms for a scalable single chip multiprocessor

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.Includes bibliographical references (p. 159-163).by Stephen William Keckler.Ph.D

    The M-Machine Multicomputer

    Get PDF
    The M-Machine is an experimental multicomputer being developed to test architectural concepts motivated by the constraints of modern semiconductor technology and the demands of programming systems. The M- Machine computing nodes are connected with a 3-D mesh network; each node is a multithreaded processor incorporating 12 function units, on-chip cache, and local memory. The multiple function units are used to exploit both instruction-level and thread-level parallelism. A user accessible message passing system yields fast communication and synchronization between nodes. Rapid access to remote memory is provided transparently to the user with a combination of hardware and software mechanisms. This paper presents the architecture of the M-Machine and describes how its mechanisms maximize both single thread performance and overall system throughput

    Universal Mechanisms for Data-Parallel Architectures

    No full text
    Data-parallel programs are both growing in importance and increasing in diversity, resulting in specialized processors targeted at specific classes of these programs. This paper presents a classification scheme for data-parallel program attributes, and proposes micro-architectural mechanisms to support applications with diverse behavior using a single reconfigurable architecture. We focus on the following four broad kinds of data-parallel programs --- DSP/multimedia, scientific, networking, and real-time graphics workloads. While all of these programs exhibit high computational intensity, coarse-grain regular control behavior, and some regular memory access behavior, they show wide variance in the computation requirements, fine grain control behavior, and the frequency of other types of memory accesses. Based on this study of application attributes, this paper proposes a set of general micro-architectural mechanisms that enable a baseline architecture to be dynamically tailored to the demands of a particular application. These mechanisms provide efficient execution across a spectrum of data-parallel applications and can be applied to diverse architectures ranging from vector cores to conventional superscalar cores. Our results using a baseline TRIPS processor show that the configurability of the architecture to the application demands provides harmonic mean performance improvement of 5%--55% over scalable yet less flexible architectures, and performs competitively against specialized architectures

    Architecting an Energy-Efficient DRAM System for GPUs

    No full text
    This paper proposes an energy-efficient, high-throughput DRAM architecture for GPUs and throughput processors. In these systems, requests from thousands of concurrent threads compete for a limited number of DRAM row buffers. As a result, only a fraction of the data fetched into a row buffer is used, leading to significant energy overheads. Our proposed DRAM architecture exploits the hierarchical organization of a DRAM bank to reduce the minimum row activation granularity. To avoid significant incremental area with this approach, we must partition the DRAM datapath into a number of semi-independent subchannels. These narrow subchannels increase data toggling energy which we mitigate using a static data reordering scheme designed to lower the toggle rate. This design has 35% lower energy consumption than a die-stacked DRAM with 2.6% area overhead. The resulting architecture, when augmented with an improved memory access protocol, can support parallel operations across the semi-independent subchannels, thereby improving system performance by 13% on average for a range of workloads.1

    Efficient, Protected Message Interface in the MIT M-Machine

    No full text
    We present a user-level message interface that provides high performance and very low processor overhead. In this system, messages are launched from within the user's general register file, and received in a hardware queue mapped to a general register. A message handler is started within the latency of a jump instruction upon arrival of the first message word, up to 18\Theta faster than conventional interruptdriven interfaces. These tightly-integrated mechanisms feature end-to-end latencies as low as nearly one quarter that of memory-mapped interfaces. Copying is eliminated in these mechanisms, reducing processor occupancy by up to one-third when compared with other integrated register-mapped systems. The interface also features a low overhead, robust protection model using virtually-addressed message destinations and certified trusted handlers. Both are specified with unforgeable pointers, enabling finegrain control over a user thread's accessible domain as well as its permitted remo..
    corecore